Fix CHAR/VARCHAR length overflow when writing reconcile intermediate data#2428
Fix CHAR/VARCHAR length overflow when writing reconcile intermediate data#2428moomindani wants to merge 4 commits into
Conversation
Some data sources (e.g., Teradata) return CHAR(n) values with space
padding via JDBC, resulting in values that exceed the declared column
length. Delta enforces CHAR/VARCHAR length constraints through column
metadata (__CHAR_VARCHAR_TYPE_STRING), causing writes to fail for these
padded values.
Strip all column metadata via col.alias(metadata={}) before writing
intermediate DataFrames to Delta. This removes the constraint that
Delta uses for length enforcement.
Observed with Teradata via Lakehouse Federation but not with Lakebase
(PostgreSQL) via Lakehouse Federation.
Co-authored-by: Isaac
- black reformats list comprehension to single line in test helper - ruff removes unused StringType import (was used in main, dropped after merge) Co-authored-by: Isaac
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2428 +/- ##
==========================================
- Coverage 65.78% 65.78% -0.01%
==========================================
Files 98 98
Lines 9237 9242 +5
Branches 992 992
==========================================
+ Hits 6077 6080 +3
- Misses 2984 2986 +2
Partials 176 176 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
`_write_df_to_delta` is a module-level function and accessed ReconIntermediatePersist._strip_char_varchar_constraints from outside the class, which pylint flags as protected-access. Rename to public since the helper is effectively a utility. Also rename mock_select unused arg to *_cols and fix test fn names. Co-authored-by: Isaac
|
✅ 148/148 passed, 5 skipped, 24m59s total Running from acceptance #4311 |
m-abulazm
left a comment
There was a problem hiding this comment.
this looks like a deeper issue with teradata jdbc driver:
- https://stackoverflow.com/questions/70596812/how-to-avoid-blank-spaces-while-loading-data-from-teradata-to-databricks
- https://support.teradata.com/community?id=community_question&sys_id=ca9847a71b97fb00682ca8233a4bcb41
- https://teradata-docs.s3.amazonaws.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_5.html#BGBJECGD
anyway the better fix right now would be to trim variable length columns in databricks/labs/lakebridge/reconcile/query_builder/expression_generator.py:250 for databricks which will apply to teradata foreign catalogs
also please investigate the current query that gets built. it should already have trim as it is specified as the universal transformation |
Changes
What does this PR do?
Strip CHAR(n)/VARCHAR(n) length constraints from DataFrames before writing intermediate data to Delta during reconciliation. This prevents
DELTA_EXCEED_CHAR_VARCHAR_LIMITerrors when source data contains space-padded CHAR values.Root cause
Some data sources (e.g., Teradata) return CHAR(n) values with space padding via JDBC, resulting in values that exceed the declared column length (e.g., a CHAR(16) column returning 16 digits + 16 spaces = 32 characters). Delta enforces CHAR/VARCHAR length constraints through column metadata (
__CHAR_VARCHAR_TYPE_STRING), causing writes to fail for these padded values.This was observed with Teradata via Lakehouse Federation but not with Lakebase (PostgreSQL) via Lakehouse Federation.
Fix
Strip all column metadata via
col.alias(name, metadata={})before writing intermediate DataFrames to Delta. This removes the constraint that Delta uses for length enforcement. The intermediate data is temporary and does not need metadata preservation.Linked issues
Fixes #2389
Tests
Test plan
test_strip_char_varchar_constraints_strips_metadata— verifies CHAR/VARCHAR metadata is strippedtest_strip_char_varchar_constraints_preserves_types— verifies column types are preservedReopened from #2390 on an upstream branch to bypass the fork-PR OIDC restriction on JFrog auth (CI cannot run on fork PRs). All review comments and history are preserved on the original PR.